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Abstract 

Background: Sets of genes that are known to be associated with each other can be used to interpret microarray 
data. This gene set approach to microarray data analysis can illustrate patterns of gene expression which may be 
more informative than analyzing the expression of individual genes. Various statistical approaches exist for the 
analysis of gene sets. There are three main classes of these methods: over-representation analysis, functional class 
scoring, and pathway topology based methods. 

Methods: We propose weighted hypergeometric and weighted chi-squared methods in order to assign a rank to 
the degree to which each gene participates in the enrichment. Each gene is assigned a weight determined by the 
absolute value of its log fold change, which is then raised to a certain power. The power value can be adjusted as 
needed. Datasets from the Gene Expression Omnibus are used to test the method. The significantly enriched 
pathways are validated through searching the literature in order to determine their relevance to the dataset. 

Results: Although these methods detect fewer significantly enriched pathways, they can potentially produce more 
relevant results. Furthermore, we compare the results of different enrichment methods on a set of microarray 
studies all containing data from various rodent neuropathic pain models. 

Discussion: Our method is able to produce more consistent results than other methods when evaluated on 
similar datasets. It can also potentially detect relevant pathways that are not identified by the standard methods. 
However, the lack of biological ground truth makes validating the method difficult. 



Introduction 

Due to their ability to provide comprehensive snapshots of 
cellular activity, microarrays have become a widely utilized 
tool in bio-medical sciences. Microarray-based gene 
expression detection has been used for biomarker discov- 
ery as well as diagnostic and prognostic purposes [1-4]. 
Online microarray experiment repositories such as Gene 
Expression Omnibus (GEO) [5,6], ArrayExpress [7], and 
Stanford Microarray Database (SMD) [8] are invaluable 
resources containing gene expression profiles that span 
multiple developmental stages, experimental conditions, 
and model organisms [9,10]. There are numerous chal- 
lenges presented by the expanding availability of microar- 
ray data. The difficulty of interpreting the lists of 
significant genes produced by microarray experiments is a 



major challenge. The staggering number and diversity of 
the differentially expressed genes can be hard to interpret 
in a biologically meaningful way. As a result several statis- 
tical methods for gene set enrichment have been devel- 
oped. The set of differentially expressed genes is compared 
to gene sets from various databases including Gene Ontol- 
ogy (GO) [11] or the Kyoto Encyclopedia of Genes and 
Genomes (KEGG) [12]. 

Huang et al. reviewed and classified 68 available tools for 
the statistical analysis of gene sets [13]. Huang classified 
the available pathway enrichment methods into three cate- 
gories: over-representation analysis (ORA), functional class 
scoring (FCS), and pathway topology (PT) based methods 
[13,14]. In ORA, a list of genes is compiled by selecting 
genes based on their significance, fold change, or both. 
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ORA techniques seek to identify whether the gene list is 
over-represented in a gene set or pathway. In ORA 
approaches, if k genes from the list are found in a pathway 
then the probability of finding k or more genes is calcu- 
lated. The resulting p-values are used to determine whether 
or not a pathway or gene set is significantly enriched. The 
probability can be calculated using the chi-squared distribu- 
tion, Fisher's Exact Test, the binomial probability distribu- 
tion, or the hypergeometric distribution [13]. 

In functional class scoring approaches, such as Gene Set 
Enrichment Analysis (GSEA) [15], all genes are considered 
when calculating enrichment instead of a pre-selected list 
[14,15]. This can deliver improved statistical power [13]. 
In the FCS approaches, genes are assigned ranks. In 
GSEA, a gene's rank is determined by its correlation with 
the experimental sample classifications. When calculating 
the significance of a gene set, the null hypothesis is that 
the genes in a set are randomly distributed throughout the 
ranked list of genes from the microarray experiment. 
GSEA creates a null distribution by randomly permuting 
the labels of the samples and producing lists of genes 
ranked by their correlation with the newly shuffled sample 
labels. Using this null distribution to estimate the signifi- 
cance is analogous to a weighted Kolmogorov-Smirnov- 
like statistic [15]. In contrast, Parametric Analysis of Gene 
Set Enrichment (PAGE) determines a z-score for a set and 
uses normal distribution to determine significance [16]. 

Both ORA and FCS approaches ignore the connections 
between genes in a pathway, however PT-based approaches 
integrate the information contain in the edges of a pathway 
when determining the enrichment. The disadvantage of 
PT-based approaches is that they cannot be applied to the 
Gene Ontology [14]. ScorePAGE computes a score that 
represents the similarity between pairs of genes, and then 
divides this score by the number of edges between the two 
genes [14,17]. Another approach is that of Signaling Path- 
way Impact Analysis (SPIA), which computes a "perturba- 
tion factor" for each gene in a pathway. This is given by 
the change in expression of the gene and by a linear func- 
tion of the perturbation factors of all the other genes in the 
pathway. The "impact factor" of the pathway is a statistic 
calculated by taking the sum of the perturbation factors of 
the genes in the pathway [14,18]. 

We have previously proposed a method for enriching 
gene sets that is a hybrid of over-representation and 
functional class scoring [19]. Our method requires the 
contribution of all the genes in the dataset. Each gene 
contributes to the enrichment score in proportion to its 
fold change. Like ORA we calculate significance using 
the hypergeometric or chi-square distribution; however 
our method weighs the probability calculation by the fold 
change of the genes. In our method each gene is assigned 
a score based on its fold change, and we create a pseudo 
pathway, which is proportionally larger than the original 



pathway. We then calculate the significance of sampling 
the sum of the scores of the genes from the larger pseudo 
pathway. 

We applied this pathway enrichment methodology in 
order to perform a meta-analysis of rodent neuropathic 
pain microarray experiments. Neuropathic pain is a 
chronic condition resulting from damage to any part of 
the nervous system or from diseases affecting an area of 
the nervous system. Neuropathic pain is typically accom- 
panied by inflammation [20] and sensory and motor 
dysfunction [21]. Up to eight percent of the general popu- 
lation is affected by neuropathic pain [22,23]. While there 
is no clear etiology for neuropathic pain, spinal cord 
injury, diabetes, alcoholism, chemotherapy, chronic viral 
infection, transverse myelitis, and strokes are common 
causes. Due to the complex etiology and symptoms of 
neuropathic pain and its poorly understood mechanisms, 
the classification of chronic pain syndromes has remained 
largely subjective. Common treatments are able to pro- 
duce better than moderate pain relief in only one third of 
patients [24] . Treatments such as opiates, tricyclic antide- 
pressants, anti-convulsants, anti-epileptics, topical analge- 
sics, and NMDA-antagonists are used despite their limited 
efficacy and harmful side-effects [25,26]. There are several 
rodent models of neuropathic pain such as nerve ligation, 
chronic constriction, and spared nerve injury [27,28]. 

Methods 

Gene ID mapping 

Before we could begin performing enrichment analysis, we 
needed to construct a back-end database containing rele- 
vant information from various databases. Towards this 
end, we stored all the KEGG pathways and the genes 
involved in each pathway in a database. We further cre- 
ated a database to map the correspondence of Entrez 
genes with Affymetrix probe identifiers to enable gene 
identifier conversion. The correspondence between Entrez 
gene identifiers and KEGG gene identifiers was also 
mapped. Ultimately, a database for the KEGG pathway 
information and a database for gene identifier conversion 
were created in SQLite. Only genes from Homo Sapiens, 
Rattus Norvegicus, and Mus Musculus were included in 
the database. We mapped Affymetrix gene identifiers to 
Entrez Gene identifiers for Affymetrix microarray datasets 
with binary classifications obtained from the Gene Expres- 
sion Omnibus (GEO) [6]. Because multiple Affymetrix 
probes can map to a single Entrez Gene, we took the 
mean of the fold-change of the corresponding probes and 
the minimum of their p-values. 

Enrichment 

Before calculating enrichment, we quantile-normalized 
the raw data and computed the fold change of each 
gene. A two-tailed Student's t-test with an alpha value 



Qureshi and Sacan BMC Systems Biology 2013, 7(Suppl 4):S10 
http://www.biomedcentral.eom/1 752-0509/7/S4/S1 0 



Page 3 of 1 1 



of 0.01 was used to identify significant differentially 
expressed genes. The standard hypergeometric test was 
used to perform enrichment for comparison to our 
method. The probability of finding X >k significant 
genes in a particular gene set or pathway is calculated 
as follows: 



P(X>k) = l-J2 



N - 
n ■ 



r ) 



r=0 



(1) 



where N is the number of genes on the array, m is the 
number of significant genes, n is the number of genes in 
the particular KEGG pathway, and k is the number of 
genes that are both significant and present in the parti- 
cular KEGG pathway. Thus we were able to calculate 
the significance of the enrichment of the KEGG path- 
ways and rank them by their significance. 

For our weighted hypergeometric and chi-squared 
tests, each gene was assigned a score calculated as 
shown in the formula below: 



& = \l°g2 (fold change (genei))\ a 



(2) 



The power, a, is an adjustable parameter. In each 
dataset, a value Q was calculated by taking the maxi- 
mum of the gene scores in the pathway. The hypergeo- 
metric distribution is a discrete probability distribution 
function; however our gene scores existed on a continu- 
ous scale. Thus, we had to ensure that our gene scores 
were discrete. Rounding the values of the gene scores to 
the nearest whole number accomplished this. For each 
KEGG pathway, we calculated k by taking the sum of 
the scores of the genes involved in the pathway, as 
shown in the formula below: 



(3) 



where n is the number of genes in a particular KEGG 
pathway. Each individual gene's score g it corresponded 
to the number of copies of that gene that were consid- 
ered significant in the pseudo pathway. The value k cor- 
responds to the total number of significant genes in the 
pseudo pathway. We then utilized the hypergeometric 
distribution to calculate the probability that the pathway 
score was greater than k, according to the formula 
below 



P(X>k) = l-J2 
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where all variables represent the same quantities that 
they do in equation 1, and all quantities are rounded to 
the nearest whole number. We ranked the pathways 
using this p-value. 

A similar approach was applied to the chi-squared statis- 
tic. The chi-squared distribution represents an approxima- 
tion of the exact probability of sampling without 
replacement, which is determined by the hypergeometric 
distribution. The chi-squared statistic was sometimes used 
because of the difficulty of computing hypergeometric 
probabilities for large populations. The chi-squared sta- 
tistic [29] is determined using the 2x2 table shown in 
Table 1. The values from Table 1 are used in the equation 
shown below 



N{\n n n 22 - n u n 21 \) 
N lr N 2r N lc N 2c 



(5) 



We utilize a chi-squared distribution with 1 degree of 
freedom, which is calculated from Table 1 as follows: 



df = (r-\)(c-\) 



(6) 



where r is the number of rows in the table and c is the 
number of columns. We compute a weighted chi-squared 
statistic by constructing a table similar to Table 1, but in 
the place of the significant genes column we use the path- 
way score calculated by Equation 5 and the sum of the 
scores of all the genes on the array. Unlike the hypergeo- 
metric probability distribution, the chi-squared probability 
distribution is continuous. We did not need to discretize 
our data. 

Experiments and results 

We tested these weighted enrichment approaches using a 
microarray dataset from that that compares C. Pneumo- 
niae infected dendritic cells and mock-infected controls 
[30]. The authors of the original dataset did not conduct 
enrichment analysis during their study. The enriched path- 
ways resulting from standard hypergeometric enrichment 
were compared to the enriched pathways resulting from 
weighted hypergeometric and chi-squared enrichment. 
The top-10 most significant pathways detected by hyper- 
geometric enrichment are shown in Table 2. Table 3 

Table 1 The 2x2 table used to calculate the chi-squared 
statistic. 





Genes on 
Array 


Significant 
Genes 




In Pathway 


nn 


n 12 


Nn = n„ + n 12 


Not in 
Pathway 


n 21 


n 22 


N 2r = n 21 + n 22 




Nic = n,,+n 21 


N 2c = n 12 +n 22 


N = n n +n q2 +n 21 
+n 22 
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Table 2 The results of hypergeometric enrichment of the 
genes that are significant at the 0 



Pathway 


p-value 


FDR 


# of significant 
genes 


Nitrogen metabolism 


0.000329 


0.080509 


4 


Biotin metabolism 


0.000815 


0.099656 


1 


Prion diseases 


0.002291 


0.186713 


4 


Natural killer cell mediated 


0.002489 


0.152139 


9 


cytotoxicity 








Gap junction 


0.002548 


0.124621 


7 


Cytokine-cytokine receptor 


0.002632 


0.107276 


15 


interaction 








ErbB signaling pathway 


0.002747 


0.095982 


7 


Osteoclast differentiation 


0.002976 


0.090963 


9 


Non-small cell lung cancer 


0.003207 


0.087127 


5 


Vibrio cholerae infection 


0.00428 


0.104663 


5 



shows the top-10 pathways enriched by the weighted 
hypergeometric method. Table 4 shows the top-10 path- 
ways produced by weighted chi-squared enrichment. 

Hypergeometric enrichment detected 56 significant 
pathways (p < 0.05). Table 3 shows that weighted hyper- 
geometric enrichment only detected four significant 
pathway (p < 0.05). Weighted chi-squared enrichment 
only detected one significant pathway, as shown in 
Table 4. This significant pathway, which contained only 
2 significant genes, was ranked 132 by hypergeometric 
enrichment and had a p-value of 0.23. The two most 
significant pathways according to weighted hypergeo- 
metric enrichment were both glycosphingolipid bio- 
synthesis pathways, which were not detected by the 
standard method because none of the genes in the path- 
way were significant despite high fold-change. Despite 
having no significant genes (p < 0.01), the mean expres- 
sion change of the genes in the glycosphingolipid bio- 
synthesis-globo series pathway was over 3-fold; specific 
values for significance and fold-change for the genes in 

Table 3 The results of weighted hypergeometric 
enrichment of the C.Pneumonia infection dataset 



Pathway P-Value FDR Pathway 



Score 


Glycosphingolipid biosynthesis - globo 


0.011041 1 


22 


series 






Glycosphingolipid biosynthesis - ganglio 


0.01264 1 


23 


series 






Glycosaminoglycan degradation 


0.019259 1 


26 


Pantothenate and CoA biosynthesis 


0.030492 1 


24 


D-Arginine and D-ornithine metabolism 


0.056692 1 


2 


Protein export 


0.063533 1 


26 


Vitamin digestion and absorption 


0.075501 1 


29 


Thiamine metabolism 


0.101094 1 


6 


Primary bile acid biosynthesis 


0.108834 1 


19 


Ether lipid metabolism 


0.114348 1 


40 



Table 4 The results of weighted chi-squared enrichment 
of the C. Pneumonia infection dataset 



Pathway 


P-Value 


FDR Pathway 
Score 


Drug metabolism - cytochrome P450 


0.034493 


1 35.97864 


Amyotrophic lateral sclerosis (ALS) 


0.076237 


1 34.94194 


Pathways in cancer 


0.081381 


1 268.3321 


Cell adhesion molecules (CAMs) 


0.102061 


1 97.041 76 


Calcium signaling pathway 


0.114357 


1 140.0101 


NOD-like receptor signaling pathway 


0.131509 


1 39.68646 


Glycosphingolipid biosynthesis - ganglio 
series 


0.141836 


1 2347024 


Glyoxylate and dicarboxylate 
metabolism 


0.14306 


1 9.819872 


Glycosphingolipid biosynthesis - globo 
series 


0.145623 


1 22.14654 


Axon guidance 


0.173427 


1 102.641 



this pathway are shown in Table 5. The glycosphingoli- 
pid biosynthesis-globo series pathway [12] is shown in 
Figure 1. Additionally, there exist both highly upregu- 
lated and downregulated genes in the pathway. 

We performed a literature search to assess the rele- 
vance of the top-10 pathways produced by the various 
methods. Since the dataset utilized involved the infection 
of cells, pathways related to the immune system should 
be enriched. Hypergeometric enrichment identified two 
potentially relevant pathways: natural killer cell mediated 
cytotoxicity and V. Cholerae infection. The top-2 path- 
ways produced by this method were nitrogen metabolism 
and biotin metabolism. Weighted hypergeometric enrich- 
ment identified both types of glycosphingolipid biosynth- 
esis as the top-2 pathways, with a glycosaminoglycan 
degradation related pathway as the third ranked pathway. 
Glycans and glycosylation are essential components of 



Table 5 The p-values and fold changes of the genes in 
the glycosphingolipid biosynthesis-globo pathway 



Entrez Gene ID 


Symbol 


P-Value 


Fold Change 


2523 


Futl 


0.154618 


1 .079052 


2524 


FUT2 


0.085096 


1 .506772 


2717 


Gla 


0.048796 


7.793774 


3073 


HexA 


0.271048 


5.073008 


3074 


Hexb 


0.166863 


7.790576 


4668 


nagA 


0.063709 


2.171953 


6482 


ST3GAL1 


0.275852 


2.535963 


6483 


ST3GAL2 


0.13051 


2.067693 


6489 


5T8SIA1 


0.357908 


7.755918 


8706 


B3galnt1 


0.029243 


0.863497 


10317 


B3galt5 


0.475799 


3.853211 


10690 


fut9 


0.275609 


1.418501 


26301 


Gbgtl 


0.338454 


0.097302 


53947 


A4GALT 


0.621914 


0.619847 
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GLYCOSPHINGOLIPID BIOSYNTHESIS - GLOBOSERIES 
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Figure 1 The glycosphingolipid biosynthesis-globo series KEGG pathway, with Entrez Genes detected on the array colored pink; this was the 
top-ranked pathway by weighted hypergeometric enrichment. 



the antigen-presenting function of dendritic cells [31]. 
Glycosphingolipids are proteins present in the plasma 
membrane that are known to be involved in immune 
function. They can act as cell-surface antigens [32,33]. 



The standard method failed to detect these pathways; 
where as the weighted hypergeometric method uncov- 
ered the action of these pathways and helped elucidate 
mechanisms of the infection of the cells. In addition, 
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despite finding fewer significant pathways the weighted 
chi-squared method also detected the glycosphingolipid 
synthesis pathways among its top- 10 pathways, although 
at lower ranks than weighted hypergeometric enrich- 
ment. These pathways, which are the top-ranked path- 
ways by the weighted hypergeometric method, are more 
biologically relevant than the top-ranked pathways gener- 
ated by the standard hypergeometric method. 

We further identified 4 datasets from the Gene Expres- 
sion Omnibus pertaining to rodent models of neuropathic 
pain. The datasets included studies utilizing spinal nerve 
ligation, sciatic nerve ligation, chronic constriction injury, 
and spared nerve injury neuropathic pain models 
[25,26,34,35]. Although only 4 studies were utilized, some 
studies contained multiple neuropathic pain models, so we 
examined differential gene expression across 5 different 
conditions. Hypergeometric, chi-squared, weighted hyper- 
geometric, and weighted chi-squared enrichment were 
applied to each of the dataset. The top- 10 most signifi- 
cantly enriched pathways were considered. The common 
pathways identified by each method in each of the datasets 
were tabulated. A power of 1 was used for the weighted 
enrichments, and a p-value cut-off of 0.01 was used for 
the unweighted enrichment. Figure 2 shows all of the 
KEGG pathways that were detected by hypergeometric 
enrichment in at least 2 datasets. Only, the ribosome, Par- 
kinson's disease, oxidative phosphorylation, and TCA 
cycle pathways were identified in 2 different datasets. 



Figure 3 shows the KEGG pathways identified in at least 2 
or more datasets by the chi-squared test. Only the Parkin- 
son's disease and oxidative phosphorylation pathways were 
identified by both chi-squared and hypergeometric enrich- 
ment. Figure 4 shows the results of applying weighted 
hypergeometric enrichment with a power of 1. Weighted 
hypergeometric enrichment is able to detect pathways 
most consistently. The lysine biosynthesis pathway is sig- 
nificantly enriched in all datasets. Unlike the other enrich- 
ment methods, weighted hypergeometric enrichment 
identified completely different pathways consisting mainly 
of metabolic pathways, and pathways relating to amino 
acids. Figure 5 contains the results of weighted chi- 
squared enrichment. Parkinson's disease and oxidative 
phosphorylation are both consistently enriched by the 
weighted chi-squared method. 

Weighted hypergeometric enrichment detected the 
butirosin and neomycin biosynthesis pathway in 3 data- 
sets, which is shown in Figure 6. This pathway was not 
enriched by any of the other methods in 2 or more data- 
sets. There is evidence for action by neomycin on the ner- 
vous system; it can block the capsaicin response of rat 
dorsal root ganglion neurons and can block N-type and 
P-type voltage dependent calcium channels [36]. Neomycin 
may also be a transient receptor ion channel 1 (TRPV1) 
antagonist. TRPV1 is a ligand-gated cation channel 
involved in multiple pain sensation mechanisms, as a 
result neomcycin can alleviate pain responses [37]. Fatty 




0 0.5 1 1.5 2 

# of datasets 

Figure 2 A compilation of the pathways selected by hypergeometric enrichment from the rodent neuropathic pain model datasets. Only 
pathways enriched in 2 or more datasets are shown. 
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chi2 



Olfactory transduction 
Metabolic pathways 
Ribosome 
Pathways in cancer 
Neuroactiue ligand-receptor interaction 

Hepatitis C 

Cytokine-cytokine receptor interaction 




0 



# of datasets 



Figure 3 A compilation of the pathways selected by chi-squared enrichment from the rodent neuropathic pain model datasets. Only pathways 
enriched in 2 or more datasets are shown. 



acid biosynthesis is enriched in two datasets, and fatty acid 
metabolites can induce pain by stimulating the transient 
receptor potential Al channel (TRPA1) [38]. The fact that 
these two pathways, which have been previously associated 
with neuropathic pain, are only identified by weighted 
hypergeometric enrichment demonstrates the potential 
advantage of weighted hypergeometric enrichment in 
identifying relevant pathways missed by the standard 
methods. 

Discussion 

Weighted hypergeometric and chi-squared enrichment 
extend over-representation analysis to include change in 
expression of the genes and include all genes instead of a 
pre-selected list. These approaches enable every gene to 
contribute to the enrichment in proportion to their fold- 
change. Changing the power parameter enables one to 
adjust how much the expression change of the genes 



contributes to the enrichment score of the pathway. Our 
approach combines ORA and FCS methodologies. Unlike 
GSEA [15] our methods can detect pathways comprised of 
both up and down regulated genes by means of the score 
calculated for each gene. This is because we consider only 
the magnitude of expression change and not its direction 
with our score. There already exists a modification of 
GSEA that allows enrichment of pathways with bidirec- 
tional gene expression [39] . 

Weighted enrichment methods are much more conser- 
vative than unweighted methods. Because the weighted 
hypergeometric enrichment methods are so conservative, 
they produce no significant results when corrected for 
multiple comparisons. The Benjamini-Hochberg false dis- 
covery correction [40] was applied to the weighted enrich- 
ment, and the results are depicted in Tables 2, 3, 4. Table 
2 shows that after false discovery rate (FDR) correction 
there are no significant pathways (p < 0.05), and that each 
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whyp 



Lysine biosynthesis 
D-Glutamine and D-glutamate metabolism 
D-Arginine and D-ornithine metabolism 
Cyanoamino acid metabolism 
Sulfur relay system 
Non-homologous end-joining 
Glycosaminoglycan biosynthesis - keratan sulfate 
Glycosaminoglycan biosynthesis - chondroitin sulfate 
Butirosin and neomycin biosynthesis 
Biotin metabolism 
Fatty acid biosynthesis 



# of datasets 

Figure 4 A compilation of the pathways selected by weighted hypergeometric enrichment from the rodent neuropathic pain model datasets. 
Only pathways enriched in 2 or more datasets are shown. 




pathway has an FDR-corrected p-value of 1. However, the 
FDR correction is not suitable for application to enrich- 
ment analysis because FDR has a high variability and 
should be applied to a larger number of p-values than 
those generated by enrichment [41]. Furthermore, it has 
been shown that most multiple comparison corrections 
decrease the power of the analysis and are also too conser- 
vative [13,42]. Additionally, the p-values resulting from 
enrichment analyses can be fragile and sensitive to non- 
statistical aspects of their calculation such as the data 
sources or the mapping of gene names between different 
conventions; these issues cannot be resolved by correction 
for multiple comparisons [13]. Huang et al. advise using 
prior biological knowledge to assess the enriched path- 
ways, and that the results of enrichment should only be 
guidelines for an investigator [13]. Thus, FDR values were 
only included to be thorough when describing the results 
of these methods. We advise considering only the top- 10 



pathways instead of multiple comparison correction. 
Furthermore, we evaluated the consistency of our methods 
by considering the top-10 pathways enriched in data from 
several similar experiments. We were able to demonstrate 
that weighted hypergeometric enrichment produced the 
most consistent results. 

Validating the weighted enrichment methods has 
proved to be challenging because there is no ground 
truth to compare the enriched pathways against. As a 
result, validation of the methods was performed based 
on literature search, which is not a complete or objec- 
tive analysis. Literature search-based validation is biased 
towards already known pathways. There is no way of 
knowing whether pathways enriched by the dataset that 
have not been previously identified in the literature are 
actually associated with the disease or are falsely identi- 
fied as enriched. Furthermore, our method is still sensi- 
tive to the handling of gene identifier mapping. Another 
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Figure 6 The butirosin and neomycin biosynthesis pathway. The nodes identified in the datasets are colored in pink. The node 2.7.1 1 
corresponds to 3 different rat genes that were identified as significant. 
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drawback of this methodology is that it ignores the 
topology of the pathways. It is possible, for example, 
that an increase in the expression of a gene could be 
canceled out by a decrease in the expression of a down- 
stream gene that is up regulated by the first gene. There 
is no way to address this situation when using our 
method. However, this drawback does not apply to the 
enrichment of Gene Ontology terms, which are arranged 
hierarchically. 

We have proposed weighted hypergeometric and chi- 
squared methods to enrich gene sets. These methods can 
produce more biologically relevant results for KEGG path- 
way enrichment than the standard hypergeometric 
approach, despite the fact that the problem of Type II 
errors is inadequately addressed by correcting for multiple 
comparisons. We also showed that our method tends to 
produce more consistent results when using data from 
similar experiments. Despite only showing the results of 
KEGG pathway enrichment, these methods can also be 
applied to the Gene Ontology classifications as well as any 
other set of genes. 
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